HTML Verison

Background

The trend of e-commerce blows sky high in the recent years, a lot of mortar retail cooperation turns themselves into the competition of e-commerce (e.g. Walmart , Target etc). The concept of this project is to utilize the dataset provided by Walmart to forecast the weekly sales in 45 stores based on set of features (e.g. Store Size,Unemployment rate, Customer Price Index (CPI) etc). The first part of data analysis result is able to reflect the past or current corporation financial condition for either single store or particular department. In addition, this experiment also explores the fact that how every feature impact on the business outcome which is one of crucial operations for a coroperation. In the second part of this analysis, predictive time-series model will be implemented to forecast the future sales in weekly basis. This analysis step is basically an indispensible step for a company to have a guide or estimate of the future direction of company's profit or even market stock price which helps the management team to prepare business stragegy for the next round of period

Client / Audience

The primary audience for this analysis is Walmart management team / executives. The findings from this research gives some hint for them to design their operation format or business plan. Furthermore, the other group of audience properly are those professional who concerns about the business condition about Walmart. The findings from the research definitly offers them some insight about the current or future sales in the mortar stores. The third group would be the people who wants to find out the sales pattern happening in Walmart store or particular department. The analysis also gives the insight how the seasonal factors changes the sales which is very valuable for professionals or merchant .

Questions to be addressed

- Discover the sales distribution in Store and individual department

- Discover the correlation of between features and Sales.

- Discover the features contributes differently to the predictive models

- Determine the best model(s) for forecasting the weekly sales (continuous values)?

- How’s the models perform differently in various States?

Dataset Exploration

Summary

  • There are 421570 instances in this dataset. Among the dataset, there are total 45 retail Stores sales result in this dataset. In each store, there are about 99 Departments.
In [1]:
import pandas as pd
import numpy as np

trainDf = pd.read_csv("./data/train.csv")
feaDf = pd.read_csv("./data/features.csv").drop(["IsHoliday"], axis=1)
storesDf = pd.read_csv("./data/stores.csv")


df1 = pd.merge(trainDf, feaDf, on = ["Store", "Date"])
#print(df1.shape)
train = pd.merge(df1, storesDf, on = "Store")
print("Dataset size:",train.shape)

#Rename column names
train.columns = ['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Temperature',
       'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
       'MarkDown5', 'CPI', 'Unemployment', 'Store_Type', 'Size']

display(train.head(3))
Dataset size: (421570, 16)
Store Dept Date Weekly_Sales IsHoliday Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment Store_Type Size
0 1 1 2010-02-05 24924.50 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315
1 1 2 2010-02-05 50605.27 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315
2 1 3 2010-02-05 13740.12 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315

Missing Values

  • Before we get into the analysis, we first deal with the missing values. As we observe the following, the missing values are constantly stacking on Five features which are MarkDown1, MarkDown2, MarkDown3, MarkDown4, MarkDown5. Besides, missing values are not found among other features. For this reason, we are going to grub deeper into those features to check where those missing values are correlatable enough to be replaced.

  • The following is the basic proportion of missing values for every store. Based on the table 1, we could observe that the proportion of missing among these five features are ranging between (63%-90%). With these high volumes of missing values, we are going to grub more to check whether the values are randomly missing or not. Table 2 shows that the range of data without missing values from 2011-11-11 to 2012-10-26 and Table 3 shows that the range of data with missing values from 2010-02-05 to 2011-11-04. In this case, we could observe that the large volume of missing values are NOT randomly missed. In the consideration of the huge volumes as well as the missing patterns, I would propose to drop the five features (i.e. columns) instead of records (i.e. rows) because this is important to keep as many records as we could for the time-series analysis.

In [5]:
cntLst = []
print("Table 1")
print('''Proportion of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''')
print()
cnt=0
for store in train.Store.unique():
    for var in ["MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"]:
        cntLst.append(train[var][train.Store == store].isnull().sum())
    if cnt<=5:
        print("Store #%d"%(store), end="")
        print(" with %d instance has missing values"%(train[train.Store == store].shape[0]), end=" ")
        print(cntLst, end=" ")
        print("in percent", end=" ")
        print(np.round(np.array(cntLst) / train[train.Store == store].shape[0], 2)*100, end="\n")
        cnt+=1
    elif cnt<=6:
        print("...")
        cnt+=1
    else:
        pass
    cntLst=[]
Table 1
Proportion of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"

Store #1 with 10244 instance has missing values [6587, 7229, 6656, 6587, 6587] in percent [64. 71. 65. 64. 64.]
Store #2 with 10238 instance has missing values [6575, 7218, 6646, 6575, 6575] in percent [64. 71. 65. 64. 64.]
Store #3 with 9036 instance has missing values [5791, 6554, 6230, 5922, 5791] in percent [64. 73. 69. 66. 64.]
Store #4 with 10272 instance has missing values [6596, 7171, 6739, 6668, 6596] in percent [64. 70. 66. 65. 64.]
Store #5 with 8999 instance has missing values [5776, 6660, 6279, 5906, 5776] in percent [64. 74. 70. 66. 64.]
Store #6 with 10211 instance has missing values [6559, 7057, 6702, 6559, 6559] in percent [64. 69. 66. 64. 64.]
...
In [7]:
cntLst = []
print("Table 2")
print('''Date range WITHOUT Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''', end="\n")
print()
cnt=0
for store in train.Store.unique():
    arr = sorted(pd.to_datetime(train.Date[train.MarkDown1.isnull() == False][train.Store == store].unique()))
    if cnt<=5:
        print("Store %d %s - %s"%(store,min(arr),max(arr)))
        cnt+=1
    elif cnt<=6:
        print("...")
        cnt+=1
    else:
        pass
Table 2
Date range WITHOUT Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"

Store 1 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 2 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 3 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 4 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 5 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 6 2011-11-11 00:00:00 - 2012-10-26 00:00:00
...
In [8]:
cntLst = []
print("Table 3")
print('''Date range of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''', end="\n")
print()
cnt=0
for store in train.Store.unique():
    arr = sorted(pd.to_datetime(train.Date[train.MarkDown1.isnull() == True][train.Store == store].unique()))
    if cnt<=5:
        print("Store %d %s - %s"%(store,min(arr),max(arr)))
        cnt+=1
    elif cnt<=6:
        print("...")
        cnt+=1
    else:
        pass
Table 3
Date range of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"

Store 1 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 2 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 3 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 4 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 5 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 6 2010-02-05 00:00:00 - 2011-11-04 00:00:00
...

Selected Features

  • isHoliday(bnary)
  • Temperature(continuous)
  • Fuel_Price(continuous)
  • Customer Price Index CPI(continuous)
  • Store Type(Binary)
  • Unemployment Rate(continuous)

Categorical Features

  • This dataset contains two categorical features which are "IsHoliday" and "Store_Type" variables. In order to proceed time-series analysis, these Cat attributes will be dummied into individual variables as Table 4. As you see, the binary "IsHoliday" and "Store_Type" features are divided into 2 columns (i.e. 1,0) and 3 columns (i.e. A,B,C)
In [5]:
print("Table 4")
print('''Split the feature "Type" into A,B,C and "IsHoliday_x" into 1,0 ''')
#display(pd.get_dummies(train[["IsHoliday_x"]]))
pd.concat((pd.get_dummies(train["Store_Type"]),pd.get_dummies(train["IsHoliday"])), axis=1).head(5)
Table 4
Split the feature "Type" into A,B,C and "IsHoliday_x" into 1,0 
Out[5]:
A B C False True
0 1 0 0 1 0
1 1 0 0 1 0
2 1 0 0 1 0
3 1 0 0 1 0
4 1 0 0 1 0

Outliers

  • Weekly_Sales: Due to geographical or holiday features, the Sales associated with every store should varies. Refer to Figure 1, the Weekly sales mot only varies quite a bit, a considerablely large amount data point are out of the range of typical range of sales. As mentioned, the variation might be caused by mixed factors. For this reason, I may just leave those "Outliers" as is for now.
In [9]:
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.offline as offline

init_notebook_mode(connected=True)

namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.Weekly_Sales[train.Store == int(name)]/1000,2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of Sales Values by each store (Figure 1)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Weekly Sales (in thousands)",
                 titlefont=dict(size=18),
                 zeroline=True, range=[-10,200], showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig, show_link=True)

Outliers

  • Temperature: Refer to figure 2, the other variable "Temperature" looks very consistant across the stores. Also, there are NO outliers significantly out of their typical range among the stores.
In [7]:
namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.Temperature[train.Store == int(name)],2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of Temperature Values by each store (Figure 2)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Temperature(F)",
                 titlefont=dict(size=18),
                 zeroline=True, range=[-5,120], showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig)

Outlier

  • Fuel Price: Refer to figure 3, the feature "Fuel_Price" which is even more consistant than the "Temperature" variable ranges between 2.5 to 4.5 among the stores. So that I would believe that there is NO outliers existing in this feature. This figure makes sense to us because Fuel Price has its market Price across the countries, so that there should be not much variation in each geographical area as expected.
In [8]:
namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.Fuel_Price[train.Store == int(name)],2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of Fuel Price Values by each store (Figure 3)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,
                ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Fuel Price($)",
                 titlefont=dict(size=18),
                 zeroline=True, 
                 range=[2,5], 
                 showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig)

Outlier

  • Consumer Price Index (CPI): According to U.S. Bureau of Labor Statistics, the meaning of CPI is "a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services". Refer to Figure 4, the CPI value, ranging from 120 to 230, differs quite much among the stores. However, there is NO existing outliers shown in this dataset. This figure relatively indicates us the consuming power across stores. For hypothetical assumption, the area with higher CPI values should expect more consuming power which is supposed to generate more sales amount for the store, or vice versa. Further analysis will be conducted to measure this aspect of feature along with this analysis proceed.
In [9]:
namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.CPI[train.Store == int(name)],2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of CPI Values by each store (Figure 4)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="CPI",
                 titlefont=dict(size=18),
                 zeroline=True, range=[100,250], showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig)

Outlier

  • Unemployment: The unemployment rate is another index shows the consuming power in the country. Refer to the Figure 5, the data shows that most of the values fall between 6 to 8 withouth outliers among the stores. Similar to the "CPI" feature, Further analysis will be conducted to measure how this feature influence to the Weekly Sales among the stores.
In [15]:
namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.Unemployment[train.Store == int(name)],2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of Unemployment Rate by each store (Figure 5)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Unemployment Rate",
                 titlefont=dict(size=18),
                 zeroline=True, range=[0,20], showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig)

Data Analysis


Weekly Sales Summary

Let's first take a close look on the trend of weekly sales during this two year period. The sales plot captures the sales trend ranges from Feb-2011 - Oct-2012. As we can see that the sales ranges approximate between 0 to 4 million. The sales trends for each store move flat over time except the two holiday seasons which start approximately from Nov 20 to Dec 31 for both years. As we all know, this period covers two big holidays which are Thanksgiving and Christmas retailers traditionally offers huge discount discount to customers. During these high selling period, the sales hike 1.5 - 2 times of the original sales. So that we ought to pay more attention to the prediction for this holiday period.

In [13]:
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.offline as offline

init_notebook_mode(connected=True)
In [11]:
#Reload Data

import pandas as pd
import numpy as np

trainDf = pd.read_csv("./data/train.csv")
feaDf = pd.read_csv("./data/features.csv").drop(["IsHoliday"], axis=1)
storesDf = pd.read_csv("./data/stores.csv")


df1 = pd.merge(trainDf, feaDf, on = ["Store", "Date"])
#print(df1.shape)
train = pd.merge(df1, storesDf, on = "Store")
print("Dataset size:",train.shape)

#Rename column names
train.columns = ['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Temperature',
       'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
       'MarkDown5', 'CPI', 'Unemployment', 'Store_Type', 'Size']
Dataset size: (421570, 16)
In [12]:
# Dropping Features
train.drop(columns=["MarkDown1","MarkDown2","MarkDown3","MarkDown4", "MarkDown5"], inplace = True)
In [14]:
#Gather Weekly Sales for each store
namesLst=list(map(str,train.Store.unique()))
dateLst = train.Date.unique()

aa=[]
salesLst=[]
timeLst=[]
storeSalesDict={}
for name in namesLst:
    #print(name)
    storeSalesDict[int(name)]=[]
    for date in dateLst:
        aa=round(train.Weekly_Sales[train.Store == int(name)][train.Date == date].sum()/ 1000000,2)
        storeSalesDict[int(name)].append(aa)
In [15]:
traces=[]

for i, sales in storeSalesDict.items():
    #print(i, sales)
    traces.append(go.Scatter(
    x= pd.to_datetime(dateLst),
    y=sales,
    name="Store#"+str(i)
    ))
In [16]:
import numpy as np
aa= np.zeros((len(storeSalesDict),len(storeSalesDict)))
aa = [[str(aa[i][j]) for i in range(0,aa.shape[0])] for j in range(0,aa.shape[1])]
aa = [[False for i in range(0,len(aa))] for j in range(0,len(aa))]
for i in range(len(aa)):
    aa[i][i] = True
In [17]:
updatemenus = list([
    dict(type="buttons",
         active=-1,
         buttons=list([
            dict(label = 'Store 1',
                 method = 'update',
                 args = [{'visible': aa[0]},
                         {'title': 'Store 1 Sales'}])]))])
In [20]:
layout = go.Layout(
    title = 'Weekly Sales (Figure 6)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=True,
    #width=700,
    #height=1000,
    yaxis =dict(title = "Weekly Sales (in Millions)", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Time Series (in Months)",
                 titlefont=dict(size=18),
                 exponentformat='e',
                 showexponent='all',
                 zeroline=True,  
                 showgrid=False,
                 rangeselector=dict(visible=True, 
                                    buttons=list([
                                        dict(count=1, 
                                             label="1m", 
                                             step="month", 
                                             stepmode ="backward"),
                                        dict(count=3, 
                                                 label="3m", 
                                                 step="month", 
                                                 stepmode ="backward"),
                                         dict(count=6, 
                                                 label="6m", 
                                                 step="month", 
                                                 stepmode ="backward"),
                                         dict(count=12, 
                                                 label="12m", 
                                                 step="month", 
                                                 stepmode ="backward"),
                                        dict(step ="all")])),
                 rangeslider = dict(visible=True)
                ),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False,
    updatemenus=updatemenus
)
In [21]:
fig = go.Figure(data=traces, layout=layout)
iplot(fig, show_link=True)

Weekly Sales in Store_Type

The figure 7 below shows the total sales generated by different store type (i.e. A, B, C). Obiviously, the sales in regular working holidays takes the major proportion of the sales in holidays because holidays like Christmas and Thanksgiving only takes small portion throughout the year. However, in figure 8, we could observe that the average weekly sales in holidays is higher than weekly sales in working days for Type A, B stores. Sales are basically NO variant for type C store. To be more precise determine their sales performance between holidays or non-holidays, I would propose to set up hypothesis test to further verify.

In [23]:
trace1 = []

salesA = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==False].sum()
salesB = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==False].sum()
salesC = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==False].sum()

trace1 = go.Bar(
    x = sorted(list(set(train.Store_Type))),
    y = [salesA, salesB, salesC],
    name = "Working day"
)

salesA = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==True].sum()
salesB = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==True].sum()
salesC = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==True].sum()

trace2=[]
trace2 = go.Bar(
    x = sorted(list(set(train.Store_Type))),
    y = [salesA, salesB, salesC],
    name = "Holiday"
)

data = [trace1, trace2]
layout = go.Layout(
    title = "Store Type Sale (%s to %s) - Total (Figure 7) "% (min(train.Date), max(train.Date)),
    barmode='group',
    showlegend=True,
    xaxis = dict(title = "Store Type"),
    yaxis = dict(title = " Total Sales")
)

fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=True)
In [25]:
#salesA_len= len(list(train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==False]))
salesA_len = len(list(set(train.Date[train.Store_Type=="A"][train.IsHoliday==False])))
salesA_tot = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==False].sum()
salesA_avg = salesA_tot/ salesA_len

salesB_len= len(list(set(train.Date[train.Store_Type=="B"][train.IsHoliday==False])))
salesB_tot = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==False].sum()
salesB_avg = salesB_tot/ salesB_len

salesC_len= len(list(set(train.Date[train.Store_Type=="C"][train.IsHoliday==False])))
salesC_tot = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==False].sum()
salesC_avg = salesC_tot/ salesC_len

trace1 = go.Bar(
    x = sorted(list(set(train.Store_Type))),
    y = [salesA_avg, salesB_avg, salesC_avg],
    name = "Working day - " + str(salesA_len) + " days"
)

salesA_len= len(list(set(train.Date[train.Store_Type=="A"][train.IsHoliday==True])))
salesA_tot = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==True].sum()
salesA_avg = salesA_tot/ salesA_len

salesB_len= len(list(set(train.Date[train.Store_Type=="B"][train.IsHoliday==True])))
salesB_tot = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==True].sum()
salesB_avg = salesB_tot/ salesB_len

salesC_len= len(list(set(train.Date[train.Store_Type=="C"][train.IsHoliday==True])))
salesC_tot = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==True].sum()
salesC_avg = salesC_tot/ salesC_len

trace2 = go.Bar(
    x = sorted(list(set(train.Store_Type))),
    y = [salesA_avg, salesB_avg, salesC_avg],
    name = "Holiday - "+str(salesA_len) + " days"
)

data = [trace1, trace2]
layout = go.Layout(
    title = "Store Type Sale (%s to %s) - Weekly (Figure 8) "% (min(train.Date), max(train.Date)),
    barmode='group',
    showlegend=True,
    xaxis = dict(title = "Store Type"),
    yaxis = dict(title = " Weekly Sales ")
)

fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=True)

Weekly Sales in Department

The figure 9 below shows the ranking of Department in terms of Weekly Sales. The highest weekly sales which is about $75000 is generated by Dept 92; The lowest weekly which is -#7.68 is generated by store 47. This plot could demonstrate the sales deficiency between stores.

In [27]:
sales_dept = []
dept_lst=[]
lst=[]
for dept in list(set(train.Dept)):
    lst.append(tuple((dept,round(train.Weekly_Sales[train.Dept == dept].sum()/len(train.Weekly_Sales[train.Dept == dept]),2))))

df = pd.DataFrame(lst).sort_values(by=[1], ascending=True)
dept_lst = list(df.iloc[:,0])
sales_dept = list(df.iloc[:,1])
    
trace3 = []
trace3 = go.Bar(
    x = list(df.iloc[:,1]),
    y = list(map(str,list(df.iloc[:,0]))),
    orientation="h"
    #name = "Working day"
)

data = [trace3]
layout = go.Layout(
    #barmode='group',
    width=700,
    height=1200, 
    #showlegend=True
    title = 'Average Weekly Sales Ranking (Figure 9)',
    yaxis =dict(title = "Department Number", 
                #exponentformat='e',
                #showexponent='all',
                titlefont=dict(size=18),
                tick0=5,
                ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=False,
                type='category'),
    xaxis = dict(title="Weekly Sales($)",
                 titlefont=dict(size=18),
                 zeroline=True, 
                 #range=[2,5], 
                 showgrid=True)
)

annotations=[]
for i in range(len(dept_lst)):
    annotations.append(dict(x=sales_dept[i], 
                            y=dept_lst[i],
                            text="%0.2f"%(sales_dept[i]),
                            font=dict(family='Arial', 
                                      size=12,
                                      color='red'),
                            showarrow=True,
                            align = "center",
                            ax=40,
                            ay=0,
                            arrowhead=0))
    layout['annotations'] = annotations


fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=True)

Correlation between with Target values

The following scatter plots rougly shows the correlations between the Predictor variables and Target variable.

  • Weekly_Sales vs Temperature
  • Weekly_Sales vs Fuel_Price
  • Weekly_Sales vs CPI
  • Weekly_Sales vs Size

Unfortunately, it is difficult to determine the intensity of correlation between those features against Weekly Sales except the Store Size (i.e. Seems the higher the Store Size leads to higher Store Weekly Sales). To further verify, hypotheses tests will be performed afterward.

In [ ]:
'''*************************No need to Run*****************'''
wsLst=[]
tempLst=[]
fpLst=[]
cpiLst=[]
sizeLst=[]
storeLst=[]
dateLst=[]

for date in list(set(train.Date)):
    print(date)
    for store in list(set(train.Store)):
        storeLst.append(store)
        dateLst.append(date)
        wsLst.append((train.Weekly_Sales[train.Store == store][train.Date == date]).sum())
        tempLst.append(list(set(train.Temperature[train.Store == store][train.Date == date]))[0])
        fpLst.append(list(set(train.Fuel_Price[train.Store == store][train.Date == date]))[0])
        cpiLst.append(list(set(train.CPI[train.Store == store][train.Date == date]))[0])
        sizeLst.append(list(set(train.Size[train.Store == store][train.Date == date]))[0])
        
pd.DataFrame(list(zip(storeLst,dateLst,wsLst, tempLst, fpLst, cpiLst, sizeLst)), 
             columns= ["store","date","Weekly_Sales", "Temperture", "Fuel_Price", "CPI", "Size"]).to_csv("Weekly_Sales_Corr2.csv")        
In [29]:
'''****************Read the Weekly Sales Correlated Feature Table************'''
df = pd.read_csv("Weekly_Sales_Corr2.csv", )
ws=list(df.Weekly_Sales)
temp=list(df.Temperture)
fp=list(df.Fuel_Price)
cpi=list(df.CPI)
size=list(df.Size)
#df.head(5,)
In [31]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from numpy import arange,array,ones
from scipy import stats
import seaborn as sns


plt.figure(figsize=(10,10))
sns.regplot(x="Fuel_Price", y="Weekly_Sales", data=df)
plt.title("Fuel Price vs Weekly_Sales", fontsize=20)
plt.show()
/Users/KevQuant/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

In [34]:
plt.figure(figsize=(10,10))
sns.regplot(x="Temperture", y="Weekly_Sales", data=df)
plt.title("Temperature vs Weekly_Sales", fontsize=20)
plt.show()
/Users/KevQuant/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

In [32]:
plt.figure(figsize=(10,10))
sns.regplot(x="CPI", y="Weekly_Sales", data=df)
plt.title("CPI vs Weekly_Sales", fontsize=20)
plt.show()
/Users/KevQuant/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

In [33]:
plt.figure(figsize=(10,10))
sns.regplot(x="Size", y="Weekly_Sales", data=df)
plt.title("Store Size vs Weekly_Sales", fontsize=20)
plt.show()
/Users/KevQuant/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

Heatmap

Dependent vs Independent Variables

With the hint from above correlation plots, the store size (Independent var.) correlates the most with the Store Weekly_Sales (Dependent var). To be able to verfy this matter, we are going to show correlation values in numeric format as the following heatmap. The values reflects the same result that store size is the only feature has a correlation value higher than 0.5. The rest of the features show very least correlation with the dependent variable Weekly_sales. In this case, I would suggest to compare the prediction results from the models which are with or without least correlated features. Hypothesis Tests will be proceeded to verify this results.

Independent vs Independent Variables

For the correlation between independent variables, we could also observe that they don't have high correlation. I would not consider to remove any of the features at this point. Same as above, Hypothesis Tests will be proceeded to verify this results.

In [39]:
import matplotlib.pyplot as plt
import seaborn as sns

df2 = df.copy()
df2.drop(columns=["Unnamed: 0","store","date"], inplace=True)

df2_corr =df2.corr()
sns.heatmap(df2_corr, linewidths=0.5, annot=True)
plt.xticks(rotation=45)
plt.title("Correlation Heatmap", fontsize=15)
plt.show()

Hypotheses Test: (Dependent vs Independent Variables)


  • Null Hypothesis H0: Features (i.e. Temperature, Fuel_Price, CPI, Store Size) completely NOT correlates with the Store Weekly Sales.
  • Alt. Hypothesis Ha: Features correlates with the Store Weekly Sales.

Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that all of the features, except Fuel_Price, correlates with Store Weekly Sales. The result also aligns the correlation values above heatmap, e.g. Correlation of FP vs Weekly_Sales = 0.0095. As mentioned, I would not remove any features during the initial Machine Learning process even though some of the features have low correlation values against Weekly_Sales(Dependant Variable). In addition, we might want to compare the performance of models which are built with or without those least correlated features.

In [40]:
#Calculate pearson correlation function
def pearson_r(x,y):
    corr_mat = np.corrcoef(x,y)
    return corr_mat[0,1]

#Calculate P-value function
def p_value(array, value):
    return (value - np.mean(array))/np.std(array)
In [41]:
xi = temp
yi = ws

No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Temperature vs Weekly Sales is", p_val[0,1])
The Z value for Temperature vs Weekly Sales is -5.12599346608834
In [42]:
xi = fp
yi = ws

No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Fuel_Price vs Weekly Sales is", p_val[0,1])
The Z value for Fuel_Price vs Weekly Sales is 0.7487650321211525
In [43]:
xi = cpi
yi = ws

No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for CPI vs Weekly Sales is", p_val[0,1])
The Z value for CPI vs Weekly Sales is -5.831720613702725
In [44]:
xi = size
yi = ws

No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs Weekly Sales is", p_val[0,1])
The Z value for Store Size vs Weekly Sales is 65.6174496224247

Hypotheses Test: (Store Size vs Other Independent Variables)


  • Null Hypothesis H0: Store Size completely NOT correlates with other features (i.e. Temperature, Fuel_Price, CPI).
  • Alt. Hypothesis Ha: Store Size correlates with other features.

Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that the Store Size correlates with Temperature. On the other hand, Store size does not correlate with other TWO features (i.e. CPI , Fuel Price). The result also aligns the correlation values above heatmap above.

In [45]:
xi = size
yi = cpi

No_bootstrap_trial = len(size)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs CPI is", p_val[0,1])
The Z value for Store Size vs CPI is -0.7824637189185721
In [46]:
xi = size
yi = fp

No_bootstrap_trial = len(size)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs Fuel Price is", p_val[0,1])
The Z value for Store Size vs Fuel Price is 0.726107704446794
In [47]:
xi = size
yi = temp

No_bootstrap_trial = len(size)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs Temperature is", p_val[0,1])
The Z value for Store Size vs Temperature is -7.463314421009281

Hypotheses Test: (Temperature vs Other Independent Variables)


  • Null Hypothesis H0: Temperature completely NOT correlates with other features (i.e. Fuel_Price, CPI).
  • Alt. Hypothesis Ha: Temperature correlates with other features.

Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that the Temperature correlates with Fuel Price and CPI values. The result also aligns the correlation values above heatmap above.

In [48]:
xi = fp
yi = temp

No_bootstrap_trial = len(temp)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Temperature vs Fuel Price is", p_val[0,1])
The Z value for Temperature vs Fuel Price is 11.676566479392918
In [49]:
xi = cpi
yi = temp

No_bootstrap_trial = len(temp)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Temperature vs CPI is", p_val[0,1])
The Z value for Temperature vs CPI is 14.266627971110935

Hypotheses Test: (Fuel Price vs CPI)


  • Null Hypothesis H0: Fuel Price completely NOT correlates with CPI.
  • Alt. Hypothesis Ha: Fuel Price correlates with CPI.

Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that the Temperature correlates with Fuel Price and CPI values. The result also aligns the correlation values above heatmap above.

In [50]:
xi = cpi
yi = fp

No_bootstrap_trial = len(fp)
coef_lst=[]
for i in range(No_bootstrap_trial):
    x  = np.random.permutation(xi)
    y  = np.random.permutation(yi)
    coef_lst.append(np.corrcoef(x,y)[0,1])

org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Fuel Price vs CPI is", p_val[0,1])
The Z value for Fuel Price vs CPI is -13.646755720403766

Hypothesis Summary


The above Hypothesis tests are able to give us some hints how the independent features react with each other and how the correlations between dependent and independent features. The most correlated feature against Weekly Sales (i.e. independent variable) is Store Size and the rest of independent features are comparably less correlated. In this part, we definitly give ourselves some insights prior to the machine learning part. Based on this analysis, we could iterate models with more features combinations in order to search out the most efficient and outperformed model.

Findings from Data Analysis


  • NOT all the Predictor variables are informative for the analysis due to huge amount of missing values which are dropped out (i.e. markdown1 - markdown5) for the data analysis and further predictive modeling.

  • Most of the instances are remained even though outliers values because those data might be reflecting seasonal patterns for the Target variable (i.e. Weekly Sales), e.g. Store sales are constantly higher in holiday seasons

  • Corrlations: Most of features, except Fuel_Price, are correlated with Target variable: Store Weekly Sales.

  • Corrlations: Predictor variables are mostly correlated with each other except Store Size vs Fuel Price.

Modeling and Forcast


  • Utilizes the above analyzed features to build predictive model forecasting the Weekly Sales (continuous) in terms of single store or department
  • Grid search out the best model based on the various measurment metrics (e.g. MSE etc).
  • Proposed Methodologies for modeling:
    • Single Regressor: Linear Regression, k-nearest neighbor, decision tree
    • Ensemble Regressor: Random Forest Regressor, Xgb Regressor
    • Deep Learning: LSTM, Convention 1D, GRU

Machine Learning

In [1]:
import pandas as pd
import numpy as np

trainDf = pd.read_csv("./data/train.csv")
feaDf = pd.read_csv("./data/features.csv").drop(["IsHoliday"], axis=1)
storesDf = pd.read_csv("./data/stores.csv")


df1 = pd.merge(trainDf, feaDf, on = ["Store", "Date"])
#print(df1.shape)
train = pd.merge(df1, storesDf, on = "Store")
print("Dataset size:",train.shape)

#Rename column names
train.columns = ['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Temperature',
       'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
       'MarkDown5', 'CPI', 'Unemployment', 'Store_Type', 'Size']

display(train.head(3))
Dataset size: (421570, 16)
Store Dept Date Weekly_Sales IsHoliday Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment Store_Type Size
0 1 1 2010-02-05 24924.50 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315
1 1 2 2010-02-05 50605.27 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315
2 1 3 2010-02-05 13740.12 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315

Dummy Categortical Variables

In [4]:
#Copy original dataset 
train2 = train.copy()
train2 = train.drop(["MarkDown1", "MarkDown2","MarkDown3","MarkDown4", "MarkDown4", "MarkDown5"], axis=1)

###Dummy out Categor
train2=pd.concat((train2,pd.get_dummies(train2["IsHoliday"] , prefix="IsHoliday")),axis=1)
train2 = pd.concat((train2,pd.get_dummies(train2["Store_Type"] , prefix="Store_Type")),axis=1)
train2.drop(["IsHoliday","Store_Type"],axis=1, inplace = True)
train2.columns
train2.head(3)
Out[4]:
Store Dept Date Weekly_Sales Temperature Fuel_Price CPI Unemployment Size IsHoliday_False IsHoliday_True Store_Type_A Store_Type_B Store_Type_C
0 1 1 2010-02-05 24924.50 42.31 2.572 211.096358 8.106 151315 1 0 1 0 0
1 1 2 2010-02-05 50605.27 42.31 2.572 211.096358 8.106 151315 1 0 1 0 0
2 1 3 2010-02-05 13740.12 42.31 2.572 211.096358 8.106 151315 1 0 1 0 0
In [135]:
# load additional module
import pickle

storeLst=[]
for store_no in sorted(list(set(train2.Store))): 
    for date in sorted(list(set(train2.Date))):
        item=train2[train2.Store==store_no][train2.Date == date]
        storeLst.append(list([store_no,
                      date,
                      round(list(set(item.Temperature))[0],2),
                      round(list(set(item.Fuel_Price))[0],3),
                      round(list(set(item.CPI))[0],5),              
                      round(list(set(item.Unemployment))[0],5),
                      list(set(item.Size))[0],
                      list(set(item.IsHoliday_False))[0],
                      list(set(item.IsHoliday_True))[0],              
                      list(set(item.Store_Type_A))[0], 
                      list(set(item.Store_Type_B))[0], 
                      list(set(item.Store_Type_C))[0], 
                      round(np.divide(item.Weekly_Sales.sum(), 1000000),5)]))
        
with open('train2.data', 'wb') as filehandle:  
    # store the data as binary data stream
    pickle.dump(storeLst, filehandle)        
/Users/KevQuant/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  import sys

Load aggregated data

In [1]:
import pandas as pd
import numpy as np
import pickle
with open('train2.data', 'rb') as filehandle:  
    # read the data as binary data stream
    storeLst = pickle.load(filehandle)
    
train3=pd.DataFrame(storeLst, columns=["store_no", "date","temp","fuel_price","cpi", "Unemploy","store_size",
                                "holiday_f","holiday_t", "type_A", "type_B", "type_C","wkly_sales" ])
train3.head(3)
Out[1]:
store_no date temp fuel_price cpi Unemploy store_size holiday_f holiday_t type_A type_B type_C wkly_sales
0 1 2010-02-05 42.31 2.572 211.09636 8.106 151315 1 0 1 0 0 1.64369
1 1 2010-02-12 38.51 2.548 211.24217 8.106 151315 0 1 1 0 0 1.64196
2 1 2010-02-19 39.93 2.514 211.28914 8.106 151315 1 0 1 0 0 1.61197

Split the date column into year, month, day columns

In [2]:
new=train3.date.str.split("-",n=2,expand=True)
train3["year"] = new[0]
train3["month"] = new[1]
train3["day"] =new[2]
train3.head()
Out[2]:
store_no date temp fuel_price cpi Unemploy store_size holiday_f holiday_t type_A type_B type_C wkly_sales year month day
0 1 2010-02-05 42.31 2.572 211.09636 8.106 151315 1 0 1 0 0 1.64369 2010 02 05
1 1 2010-02-12 38.51 2.548 211.24217 8.106 151315 0 1 1 0 0 1.64196 2010 02 12
2 1 2010-02-19 39.93 2.514 211.28914 8.106 151315 1 0 1 0 0 1.61197 2010 02 19
3 1 2010-02-26 46.63 2.561 211.31964 8.106 151315 1 0 1 0 0 1.40973 2010 02 26
4 1 2010-03-05 46.50 2.625 211.35014 8.106 151315 1 0 1 0 0 1.55481 2010 03 05

Split Train and Test set

In [4]:
train4=train3[train3.date <= "2011-11-18"]
test4=train3[train3.date > "2011-11-18"]

Function for convert the dataset with number of Windows Slides (i.e. Timestep)

In [5]:
from sklearn import preprocessing

#data is pandas dataframe
def generate_x_y(data,timestep, normalize=True):
    timestep=timestep
    min_max_scaler = preprocessing.MinMaxScaler()
    ydata=[]
    date_lst=[]
    store_lst=[]
    lst=[]
    for store_no in sorted(list(set(data.store_no))):
       # print(store_no)
        ydata.append(data["wkly_sales"][data.store_no==store_no].iloc[timestep-1:])
        date_lst.append(data["date"][data.store_no==store_no].iloc[timestep-1:])
        store_lst.append(data["store_no"][data.store_no==store_no].iloc[timestep-1:])
        arr =np.array(data[data.store_no==store_no])
        reshape_length = timestep * arr.shape[1]
        for i in range(arr.shape[0]-timestep+1):
            s=arr[i:i+timestep]
            lst.append(s.reshape(reshape_length))


    df=pd.DataFrame(lst , columns=list(data.columns)*timestep)
    xdata=df.drop(["store_no", "date", "wkly_sales"], axis=1)
    ydata=np.array(ydata).flatten()
    date_lst=np.array(date_lst).flatten()
    store_lst=np.array(store_lst).flatten()
    
    # Normalize X data
    if normalize == True:
        normalized_xdata=min_max_scaler.fit_transform(xdata)
        return(normalized_xdata,np.array(ydata), date_lst, store_lst)
    else:
        return(np.array(xdata),np.array(ydata), date_lst, store_lst)

Function to Plot Train vs Test Score

In [143]:
def scorePlot(window_slide_lst, train_score_lst, test_score_lst, modelName):
    df= pd.DataFrame(list(zip(window_slide_lst,
                              train_score_lst,
                              test_score_lst
                              )), columns=["timestep","train_score", "test_score"
                                                          ])
    df=df.set_index("timestep")


    df.plot(use_index=True, figsize=(10,5))
    plt.title(modelName)
    plt.ylabel ("Score")
    plt.xlabel("Timestep / Window_Slide")
    plt.xticks(window_slide_lst)
    plt.show()

Build model

In [19]:
from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import median_absolute_error, mean_squared_error, mean_squared_log_error

from sklearn.linear_model import LinearRegression
In [156]:
train_reg_score_Lst=[]
test_reg_score_Lst=[]
test_reg_mse_Lst=[]
window_slide_Lst =np.arange(2,10)

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)

    reg = LinearRegression().fit(xtrain, ytrain)
    a=reg.score(xtrain, ytrain)
    b=reg.score(xtest, ytest)
    c=mean_squared_error(ytest,reg.predict(xtest) )
    
    train_reg_score_Lst.append(a)
    test_reg_score_Lst.append(b)
    test_reg_mse_Lst.append(c)
In [157]:
scorePlot(window_slide_Lst, train_reg_score_Lst, test_reg_score_Lst, "Linear Regression")
In [124]:
window_slide_Lst =np.arange(2,10)
In [125]:
from xgboost import XGBRegressor 

train_xgb_score_Lst=[]
test_xgb_score_Lst=[]
test_xgb_mse_Lst=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)

    xgb = XGBRegressor()
    xgb.fit(xtrain, ytrain)
    a=xgb.score(xtrain, ytrain)
    b=xgb.score(xtest, ytest)
    c=mean_squared_error(ytest,xgb.predict(xtest) )
    
    #print(xgb)
    #print("Train Score:",a)
    #print("Test Score",b)
    #print("mean_squared_error", c)
    
    train_xgb_score_Lst.append(a)
    test_xgb_score_Lst.append(b)
    test_xgb_mse_Lst.append(c)
In [145]:
scorePlot(window_slide_Lst, train_xgb_score_Lst, test_xgb_score_Lst, "XGB Regressor")
In [127]:
from sklearn.linear_model import LassoCV, RidgeCV

train_ridge_score_Lst=[]
test_ridge_score_Lst=[]
test_ridge_mse_Lst=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)

    ridge = RidgeCV(cv=5, alphas=(0.001,0.01,0.1, 1.0, 10.0))
    ridge.fit(xtrain, ytrain)
    a=ridge.score(xtrain, ytrain)
    b=ridge.score(xtest, ytest)
    c=mean_squared_error(ytest,ridge.predict(xtest) )
    
    #print(ridge)
    #print("Train Score:",a)
    #print("Test Score",b)
    #print("mean_squared_error", c)
    
    train_ridge_score_Lst.append(a)
    test_ridge_score_Lst.append(b)
    test_ridge_mse_Lst.append(c)
In [146]:
scorePlot(window_slide_Lst, train_ridge_score_Lst, test_ridge_score_Lst, "Ridge")
In [129]:
from sklearn.linear_model import LassoCV, RidgeCV

train_lasso_score_Lst=[]
test_lasso_score_Lst=[]
test_lasso_mse_Lst=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)

    lasso = LassoCV(cv=5, alphas=(0.001,0.01,0.1, 1.0, 10.0))
    lasso.fit(xtrain, ytrain)
    a=lasso.score(xtrain, ytrain)
    b=lasso.score(xtest, ytest)
    c=mean_squared_error(ytest,lasso.predict(xtest) )
    
    #print(lasso)
    #print("Train Score:",a)
    #print("Test Score",b)
    #print("mean_squared_error", c)
    
    train_lasso_score_Lst.append(a)
    test_lasso_score_Lst.append(b)
    test_lasso_mse_Lst.append(c)
In [147]:
scorePlot(window_slide_Lst, train_lasso_score_Lst, test_lasso_score_Lst, "Lasso")
In [131]:
from sklearn import linear_model

train_sgd_score_Lst=[]
test_sgd_score_Lst=[]
test_sgd_mse_Lst=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)

    sgd = linear_model.SGDRegressor(loss='squared_loss', penalty='l2', alpha=0.00000001,max_iter=10000)
    sgd.fit(xtrain,ytrain)
    a=sgd.score(xtrain, ytrain)
    b=sgd.score(xtest, ytest)
    c=mean_squared_error(ytest,sgd.predict(xtest) )
    
    #print(sgd)
    #print("Train Score:",a)
    #print("Test Score",b)
    #print("mean_squared_error", c)
    
    train_sgd_score_Lst.append(a)
    test_sgd_score_Lst.append(b)
    test_sgd_mse_Lst.append(c)
In [148]:
scorePlot(window_slide_Lst, train_sgd_score_Lst, test_sgd_score_Lst, "SGD Regressor")
In [133]:
from sklearn import linear_model

train_bay_score_Lst=[]
test_bay_score_Lst=[]
test_bay_mse_Lst=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)

    bay = linear_model.BayesianRidge()
    bay.fit(xtrain, ytrain)
    a=bay.score(xtrain, ytrain)
    b=bay.score(xtest, ytest)
    c=mean_squared_error(ytest,bay.predict(xtest) )
    
    #print(bay)
    #print("Train Score:",a)
    #print("Test Score",b)
    #print("mean_squared_error", c)
    
    train_bay_score_Lst.append(a)
    test_bay_score_Lst.append(b)
    test_bay_mse_Lst.append(c)
In [149]:
scorePlot(window_slide_Lst, train_bay_score_Lst, test_bay_score_Lst, "Bayesian Regressor")
In [135]:
from sklearn.ensemble import RandomForestRegressor

train_rfr_score_Lst=[]
test_rfr_score_Lst=[]
test_rfr_mse_Lst=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)

    
    rfr= RandomForestRegressor(n_estimators=200)
    rfr.fit(xtrain, ytrain)
    a=rfr.score(xtrain, ytrain)
    b=rfr.score(xtest, ytest)
    c=mean_squared_error(ytest,rfr.predict(xtest) )
    
    train_rfr_score_Lst.append(a)
    test_rfr_score_Lst.append(b)
    test_rfr_mse_Lst.append(c)
In [150]:
scorePlot(window_slide_Lst, train_rfr_score_Lst, test_rfr_score_Lst, "Random Forest Regressor")
In [137]:
from sklearn.ensemble import GradientBoostingRegressor

train_gbr_score_Lst=[]
test_gbr_score_Lst=[]
test_gbr_mse_Lst=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)

    
    gbr= GradientBoostingRegressor(n_estimators=200)
    gbr.fit(xtrain, ytrain)
    a=gbr.score(xtrain, ytrain)
    b=gbr.score(xtest, ytest)
    c=mean_squared_error(ytest,gbr.predict(xtest) )

    train_gbr_score_Lst.append(a)
    test_gbr_score_Lst.append(b)
    test_gbr_mse_Lst.append(c)
In [151]:
scorePlot(window_slide_Lst, train_gbr_score_Lst, test_gbr_score_Lst, "Gradient Boosting Regressor")
In [139]:
from sklearn.ensemble import AdaBoostRegressor

train_abr_score_Lst=[]
test_abr_score_Lst=[]
test_abr_mse_Lst=[]

#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
    xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
    xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)


    abr=AdaBoostRegressor(n_estimators=400, learning_rate=0.001)
    abr.fit(xtrain, ytrain)
    a=abr.score(xtrain, ytrain)
    b=abr.score(xtest, ytest)
    c=mean_squared_error(ytest,abr.predict(xtest) )
    
    train_abr_score_Lst.append(a)
    test_abr_score_Lst.append(b)
    test_abr_mse_Lst.append(c)
In [152]:
scorePlot(window_slide_Lst, train_abr_score_Lst, test_abr_score_Lst, "AdaBoost Regressor")

MSE Comparision Plot

In [155]:
df= pd.DataFrame(list(zip(window_slide_Lst,
                          test_reg_mse_Lst,
                          test_xgb_mse_Lst,
                          test_ridge_mse_Lst,
                          test_lasso_mse_Lst,
                          test_sgd_mse_Lst,
                          test_bay_mse_Lst,
                          test_rfr_mse_Lst,
                          test_gbr_mse_Lst,
                          test_abr_mse_Lst)), columns=["timestep",
                                                       "LR", "xgb", "LR-Ridge", "LR-Lasso", 
                                                       "SGD Regressor", 
                                                       "Bayesian Regressor",
                                                       "Random Forest Regressor",
                                                       "Gradient Boosting Regressor","ada Boost Regressor"
                                                      ])
df=df.set_index("timestep")


df.plot(use_index=True, figsize=(10,5))
plt.title("MSE Comparision")
plt.ylabel ("MSE")
plt.xlabel("Timestep")
plt.xticks(np.arange(2,7))
plt.show()
Based on the above , the best three models are XGB, Gradient Boosting Regressor, Random Forest Regressor. For this reason, we will continue to grid search the these models which would offer an higher optimized result by pruning parameters (e.g. Number of Estimators)